Skip to content

feat: add jemalloc heap profiling infrastructure#449

Open
eudelins-zama wants to merge 11 commits intomainfrom
eudelins/feat/2927/memory-profiling-setup
Open

feat: add jemalloc heap profiling infrastructure#449
eudelins-zama wants to merge 11 commits intomainfrom
eudelins/feat/2927/memory-profiling-setup

Conversation

@eudelins-zama
Copy link
Contributor

@eudelins-zama eudelins-zama commented Mar 5, 2026

Description of changes

Add jemalloc heap profiling infrastructure to detect memory leaks in KMS core nodes.

Bumped trivy-action to avoid this CI failure.

What's included

Feature-gated jemalloc integration (heap-profiling Cargo feature):

  • Replaces the default allocator with tikv-jemallocator when the feature is enabled
  • Installs a SIGUSR1 signal handler that triggers on-demand heap dumps to /tmp/kms-heap/
  • Exposes two new Prometheus metrics (kms_jemalloc_allocated, kms_jemalloc_resident) via a jemalloc-stats feature on the observability crate, enabling leak-type diagnosis (application leak vs allocator fragmentation vs non-jemalloc growth)

Dedicated Cargo profile (heap-profiling):

  • Inherits from release with debug=1 (line tables only) and strip=none so jeprof can resolve addresses to function:line without the full DWARF overhead

Docker Compose overlay (profiling/docker-compose-heap-profiling.yml):

  • Overrides build args to enable the feature, profile, and -C force-frame-pointers=yes for reliable backtraces
  • Configures MALLOC_CONF with prof:true, lg_prof_sample:12, prof_gdump:true, and prof_final:true

Makefile targets:

  • build-compose-heap-profiling / start-compose-heap-profiling / stop-compose-heap-profiling
  • dump-heap-profiles: sends SIGUSR1 to all 4 cores, copies .heap files, the binary, and /proc/PID/maps locally

Analysis script (profiling/analyze-heap.sh):

  • Cross-platform (Linux + macOS) script that handles PIE/ASLR address resolution by rewriting MAPPED_LIBRARIES paths and injecting maps.txt. Not tested on macOS though.
  • Produces 4 outputs: top-leaks.txt, latest.svg, diff-leaks.txt (allocation sites that grew), and diff.svg (diff flamegraph)
  • Falls back to manual addr2line resolution when jeprof can't resolve symbols

Dockerfile changes:

  • Adds CARGO_EXTRA_FEATURES and RUSTFLAGS build args so the profiling stack can inject feature flags and frame pointers
  • Uses profile-keyed Docker cache IDs (id=cargo-target-${LTO_RELEASE}) to avoid cache collisions between release and heap-profiling builds

Config fix (separate commit):

  • Adds missing reshare field to all compose config files

Issue ticket number and link

Closes https://github.com/zama-ai/kms-internal/issues/2927

PR Checklist

I attest that all checked items are satisfied. Any deviation is clearly justified above.

  • Title follows conventional commits (e.g. chore: ...).
  • Tests added for every new pub item and test coverage has not decreased.
  • Public APIs and non-obvious logic documented; unfinished work marked as TODO(#issue).
  • unwrap/expect/panic only in tests or for invariant bugs (documented if present).
  • No dependency version changes OR (if changed) only minimal required fixes.
    • Only optional dependencies for the heap-profiling profile
  • No architectural protocol changes OR linked spec PR/issue provided.
  • No breaking deployment config changes OR devops label + infra notified + infra-team reviewer assigned.
  • No breaking gRPC / serialized data changes OR commit marked with ! and affected teams notified.
  • No modifications to existing versionized structs OR backward compatibility tests updated.
  • No critical business logic / crypto changes OR ≥2 reviewers assigned.
  • No new sensitive data fields added OR Zeroize + ZeroizeOnDrop implemented.
  • No new public storage data OR data is verifiable (signature / digest).
  • No unsafe; if unavoidable: minimal, justified, documented, and test/fuzz covered.
    • Unsafe used, but only for heap-profiling profile (not used for release profile`
  • Strongly typed boundaries: typed inputs validated at the edge; no untyped values or errors cross modules.
  • Self-review completed.
    • Reviewed everything carefully except profiling/analyze-heap.sh, I over-viewed it, seems good and it's working to generate meaningful dump comparison.

Dependency Update Questionnaire (only if deps changed or added)

Answer in the Cargo.toml next to the dependency (or here if updating):

  1. Ownership changes or suspicious concentration?
  2. Low popularity?
  3. Unusual version jump?
  4. Lacking documentation?
  5. Missing CI?
  6. No security / disclosure policy?
  7. Significant size increase?

More details and explanations for the checklist and dependency updates can be found in CONTRIBUTING.md

@cla-bot cla-bot bot added the cla-signed The CLA has been signed. label Mar 5, 2026
@eudelins-zama
Copy link
Contributor Author

Usage

# Build and start env
make build-compose-heap-profiling
make start-compose-heap-profiling

# Keygen
cargo run --bin kms-core-client -- -f core-client/config/client_local_threshold.toml -l -a insecure-key-gen
# copy key id to be used below:

# First decryption burst to have representative memory usage after some use
cargo run --bin kms-core-client -- -f core-client/config/client_local_threshold.toml -l -a public-decrypt from-args --to-encrypt 0xC2 --data-type euint64 -b 1 -n 2000 --key-id $KEY_ID --inter-request-delay-ms 500 -p 40

# First memory dump
make dump-heap-profiles

# Second decryption burst
cargo run --bin kms-core-client -- -f core-client/config/client_local_threshold.toml -l -a public-decrypt from-args --to-encrypt 0xC2 --data-type euint64 -b 1 -n 2000 --key-id $KEY_ID --inter-request-delay-ms 500 -p 40

# Second memory dump
make dump-heap-profiles

# Compare memory  usage for core-1 before/after the second bump
./profiling/analyze-heap.sh ./profiling/heap-dumps/kms-server ./profiling/heap-dumps/core-1/

# Analysis of the `./profiling/heap-analysis/diff-leaks.txt` file -> Claude is pretty good at this

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Consolidated Tests Results 2026-03-08 - 20:15:43

Test Results

passed 15 passed

Details

tests 15 tests
clock not captured
tool junit-to-ctrf
build build-and-test arrow-right test-reporter link #749
pull-request feat: add jemalloc heap profiling infrastructure link #449

test-reporter: Run #749

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Pending ⏳ Other ❓ Flaky 🍂 Duration ⏱️
15 15 0 0 0 0 0 not captured

🎉 All tests passed!

Tests

View All Tests
Test Name Status Flaky Duration
nightly_full_gen_tests_k8s_default_threshld_sequential_crs 33.0s
test_k8s_threshld_insecure 3m 19s
k8s_test_crs_uniqueness 33.0s
k8s_test_insecure_keygen_encrypt_and_public_decrypt 3m 17s
k8s_test_insecure_keygen_encrypt_multiple_types 3m 37s
k8s_test_keygen_and_crs 3m 14s
k8s_test_keygen_uniqueness 8m 52s
nightly_full_gen_tests_k8s_default_centralzd_sequential_crs 1.8s
test_k8s_centralzd_insecure 5m 2s
k8s_test_centralized_insecure 1m
nightly_full_gen_tests_default_k8s_centralized_sequential_crs 1.7s
nightly_full_gen_tests_k8s_default_centralzd_sequential_crs 1.7s
test_k8s_centralzd_insecure 1m 2s
k8s_test_centralized_insecure 1m 1s
nightly_full_gen_tests_default_k8s_centralized_sequential_crs 1.7s

🍂 No flaky tests in this run.

Github Test Reporter by CTRF 💚

🔄 This comment has been updated

@eudelins-zama eudelins-zama self-assigned this Mar 5, 2026
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Vulnerability Scan Results

Details

Report Summary

┌───────────────────────────────────┬────────────┬─────────────────┬─────────┐
│              Target               │    Type    │ Vulnerabilities │ Secrets │
├───────────────────────────────────┼────────────┼─────────────────┼─────────┤
│ base:latest (chainguard 20230214) │ chainguard │        0        │    -    │
├───────────────────────────────────┼────────────┼─────────────────┼─────────┤
│ usr/bin/yq                        │  gobinary  │        3        │    -    │
└───────────────────────────────────┴────────────┴─────────────────┴─────────┘
Legend:
- '-': Not scanned
- '0': Clean (no security findings detected)


For OSS Maintainers: VEX Notice
--------------------------------
If you're an OSS maintainer and Trivy has detected vulnerabilities in your project that you believe are not actually exploitable, consider issuing a VEX (Vulnerability Exploitability eXchange) statement.
VEX allows you to communicate the actual status of vulnerabilities in your project, improving security transparency and reducing false positives for your users.
Learn more and start using VEX: https://trivy.dev/docs/v0.69/guide/supply-chain/vex/repo#publishing-vex-documents

To disable this notice, set the TRIVY_DISABLE_VEX_NOTICE environment variable.


usr/bin/yq (gobinary)
=====================
Total: 3 (HIGH: 2, CRITICAL: 1)

┌─────────┬────────────────┬──────────┬────────┬───────────────────┬──────────────────────────────┬──────────────────────────────────────────────────────────────┐
│ Library │ Vulnerability  │ Severity │ Status │ Installed Version │        Fixed Version         │                            Title                             │
├─────────┼────────────────┼──────────┼────────┼───────────────────┼──────────────────────────────┼──────────────────────────────────────────────────────────────┤
│ stdlib  │ CVE-2025-68121 │ CRITICAL │ fixed  │ v1.25.5           │ 1.24.13, 1.25.7, 1.26.0-rc.3 │ crypto/tls: Unexpected session resumption in crypto/tls      │
│         │                │          │        │                   │                              │ https://avd.aquasec.com/nvd/cve-2025-68121                   │
│         ├────────────────┼──────────┤        │                   ├──────────────────────────────┼──────────────────────────────────────────────────────────────┤
│         │ CVE-2025-61726 │ HIGH     │        │                   │ 1.24.12, 1.25.6              │ golang: net/url: Memory exhaustion in query parameter        │
│         │                │          │        │                   │                              │ parsing in net/url                                           │
│         │                │          │        │                   │                              │ https://avd.aquasec.com/nvd/cve-2025-61726                   │
│         ├────────────────┤          │        │                   │                              ├──────────────────────────────────────────────────────────────┤
│         │ CVE-2025-61728 │          │        │                   │                              │ golang: archive/zip: Excessive CPU consumption when building │
│         │                │          │        │                   │                              │ archive index in archive/zip                                 │
│         │                │          │        │                   │                              │ https://avd.aquasec.com/nvd/cve-2025-61728                   │
└─────────┴────────────────┴──────────┴────────┴───────────────────┴──────────────────────────────┴──────────────────────────────────────────────────────────────┘

@eudelins-zama eudelins-zama marked this pull request as ready for review March 5, 2026 14:58
@eudelins-zama eudelins-zama requested a review from a team as a code owner March 5, 2026 14:58
Copy link
Contributor

@dvdplm dvdplm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT

fi
}

# sed -i: GNU sed uses -i '', BSD (macOS) sed requires -i ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply that they are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 44b0f8a (they are not the same apparently, but doc was unclear indeed)

@kc1212
Copy link
Contributor

kc1212 commented Mar 5, 2026

I've been trying this PR on mac OS but I can't figure out how to run pprof. The script profiling/analyze-heap.sh tells me to install gperftools using homebrew but when I install that there doesn't seem to be an executable.

But this is not super surprising since I've always had problems with profilers on mac OS.

If anyone has ideas please let me know, otherwise I can try compiling gperftools from scratch and see if it gives me a pporf executable.

@kc1212
Copy link
Contributor

kc1212 commented Mar 5, 2026

Ok I installed pprof using golang https://github.com/google/pprof

but then I got the error

Found 1 heap dump(s)
./profiling/analyze-heap.sh: line 170: DUMPS_ORIG: bad array subscript

It seems not all bash implementation supports negative indexing

@eudelins-zama
Copy link
Contributor Author

@kc1212 pprof != jeprof AFAIU

@eudelins-zama
Copy link
Contributor Author

Btw, feel free to push an update on macOS setup documentation if you manage to troubleshoot some issues on your side!

@kc1212
Copy link
Contributor

kc1212 commented Mar 5, 2026

@kc1212 pprof != jeprof AFAIU

well, the script is asking for jeprof, google-pprof or pprof

@eudelins-zama
Copy link
Contributor Author

Docs says:

## Host dependencies

- jeprof (from gperftools) — reads jemalloc .heap dumps
- graphviz — renders SVG flamegraphs (
- addr2line (from binutils) — resolves addresses to source lines

But indeed the script seems to accept pprof which it should not

@kc1212
Copy link
Contributor

kc1212 commented Mar 5, 2026

Btw, feel free to push an update on macOS setup documentation if you manage to troubleshoot some issues on your side!

pushed the changes, seems to be all ok now!

@dd23 dd23 reopened this Mar 10, 2026
@titouantanguy
Copy link
Contributor

Small note that you configure jemalloc to auto-dump to /tmp/kms-heap/auto but this folder doesn't exist by default in the docker container we use for profiling (leading to jemalloc complaining).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed The CLA has been signed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants